Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function: Allow more expressive array signatures #14532

Merged
merged 16 commits into from
Feb 14, 2025

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented Feb 6, 2025

This commit allows for more expressive array function signatures. Previously, ArrayFunctionSignature was an enum of potential argument combinations and orders. For many array functions, none of the ArrayFunctionSignature variants work, so they use TypeSignature::VariadicAny instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation.

As an example, this commit also updates the signature of the array_replace family of functions to use a new expressive signature, which removes a panic that existed previously.

Works towards resolving #14451

Which issue does this PR close?

Works towards closing, but doesn't fully close, #14451

Are these changes tested?

Yes

Are there any user-facing changes?

No, other than removing some panics.

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Feb 6, 2025
@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 7, 2025

This is failing CI for the following reason.

Previously, in get_valid_types() functions with the array signature ArrayAndIndexes and Array would convert top level FixedSizeLists to Lists; nested FixedSizeLists would not be changed. However, functions with the array signature ArrayAndElement, ElementAndArray, and ArrayAndElementAndOptionalIndex would recursively convert FixedSizeLists to Lists.

So for example, if we had the type FixedSizeList(FixedSizeList(Int64)), ArrayAndIndexes would convert this to List(FixedSizeList(Int64)) but ArrayAndElement would convert this to List(List(Int64)).

This PR modifies the logic of get_valid_types() so that we always recursively convert FixedSizeLists to Lists. So FixedSizeList(FixedSizeList(Int64)) is always converted to List(List(Int64)).

There's a test, test_array_element_return_type_fixed_size_list(), that essentially asserts that get_valid_types() returns List(FixedSizedList(Int32)) for the array_element function, which used to have the ArrayAndIndexes signature. This test now fails because the FixedSizedList is converted to a List recursively.

I don't understand the old behavior, and why it was different for different function signatures. So I'm having trouble figuring out what the new behavior should be. We could of course sniff out the functions arguments to see if we have [array, index+] and if so only replace top level FixedSizedList with List, but I wouldn't understand why that's correct.

@jkosh44 jkosh44 marked this pull request as ready for review February 7, 2025 02:59
@jayzhan211
Copy link
Contributor

See this, #13819 (comment).
We only convert an inner fixed-size list to a regular list when the function performs a mutation operation, such as array_append. For non-mutation operations, like array_element, we keep T as a fixed-size list

Comment on lines 230 to 236
Array {
/// A full list of the arguments accepted by this function.
arguments: Vec<ArrayFunctionArgument>,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One drawback of this structure is that someone can technically create a signature with no array, only for it to get rejected at runtime in get_valid_types(). An alternative structure could be:

/// A list of arguments that come before the array.
pre_array_aguments: Vec<ArrayFunctionArgument>,
/// A list of arguments that come after the array.
post_array_aguments: Vec<ArrayFunctionArgument>,

Then we would remove the ArrayFunctionArgument::Array variant. This would have the draw back of only allowing a single array argument in the signature, which might be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to force people to create this through a constructor that returns an error if there's no array argument.

Comment on lines +269 to +273
/// An Int64 index argument.
Index,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset might be a better name for this? It can technically be used for sizes in functions like array_resize or counts for functions like array_replace_n.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 7, 2025

See this, #13819 (comment). We only convert an inner fixed-size list to a regular list when the function performs a mutation operation, such as array_append. For non-mutation operations, like array_element, we keep T as a fixed-size list

Ah ok, so the idea was that if the function changed the size of the list, then it would recursively convert FixedSizedList to List, but if it didn't change the size then it would only change the top level FixedSizedList to a List. However, it looks like the code was making the assumption that all ArrayAndIndexes functions did not modify the list which was actually no longer true after 3dfce7d.

So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information.

@jayzhan211
Copy link
Contributor

So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information.

yes

@github-actions github-actions bot added the common Related to common crate label Feb 7, 2025
@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 10, 2025

It might be a better approach to not modify the accepted argument types (i.e. don't convert FixedSizeList to List in get_valid_types(), and instead move the logic to return_type(). Then functions can be explicit about not returning FixedSizeLists.

I pushed a commit for this to see how CI would like it. Happy to revert it if people don't like it.

I also pushed a commit to add some validations around ArrayFunctionSignature.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 11, 2025

@jayzhan211 I just pushed a commit with your idea about adding a flag for array coercion. I'm feeling pretty good about the current state of this PR, but I have the following two open questions:

  • get_valid_types() always coerces the outermost FixedSizeList to List, no matter what the value of the flag is. I'm pretty sure that matches existing behavior, but I don't fully understand it. So I just wanted to double check with you that's it's correct.
  • Can you double check that I correctly assigned a function with Some(ListCoercion::FixedSizedListToList) vs None for the flag? I'm not 100% confident that I did it correctly.

@jayzhan211
Copy link
Contributor

get_valid_types() always coerces the outermost FixedSizeList to List, no matter what the value of the flag is. I'm pretty sure that matches existing behavior, but I don't fully understand it. So I just wanted to double check with you that's it's correct

It is because of this, I think we now only coerce to list if the flag is set

    fn array(array_type: &DataType) -> Option<DataType> {
        match array_type {
            DataType::List(_) | DataType::LargeList(_) => Some(array_type.clone()),
            DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),
            _ => None,
        }
    }

pub fn new(arguments: Vec<ArrayFunctionArgument>) -> Result<Self, &'static str> {
if !arguments
.iter()
.any(|arg| *arg == ArrayFunctionArgument::Array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking here, I think we can verify the validity in get_valid_types, so the definition of signature can be simplified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually what I originally had. The benefit of the current approach is that the error happens during startup and is very clear. Otherwise the error doesn't happen until the function is first executed and the error is not as informative. Additionally, it forces people to to validate the invariant that ArrayFunctionSignature::Array contains at least one array when initializing the struct because they're forced to call unwrap() or expect().

On startup:

DataFusion CLI v45.0.0
thread 'main' panicked at datafusion/expr-common/src/signature.rs:695:22:
contains array: "missing array argument"
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/panicking.rs:76:14
   2: core::result::unwrap_failed
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/result.rs:1699:5
   3: core::result::Result<T,E>::expect
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:1061:23
   4: datafusion_expr_common::signature::Signature::array_and_index
             at /home/joe/Projects/datafusion/datafusion/expr-common/src/signature.rs:691:32
   5: datafusion_functions_nested::extract::ArrayElement::new
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/extract.rs:121:24
   6: datafusion_functions_nested::extract::array_element_udf::INSTANCE::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/macros.rs:95:29
   7: core::ops::function::FnOnce::call_once
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   8: core::ops::function::FnOnce::call_once
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   9: std::sync::lazy_lock::LazyLock<T,F>::force::{{closure}}
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/lazy_lock.rs:212:25
  10: std::sync::once::Once::call_once::{{closure}}
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/once.rs:158:41
  11: std::sys::sync::once::futex::Once::call
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/sys/sync/once/futex.rs:176:21
  12: std::sync::once::Once::call_once
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/once.rs:158:9
  13: std::sync::lazy_lock::LazyLock<T,F>::force
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/lazy_lock.rs:208:9
  14: <std::sync::lazy_lock::LazyLock<T,F> as core::ops::deref::Deref>::deref
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/lazy_lock.rs:311:9
  15: datafusion_functions_nested::extract::array_element_udf
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/macros.rs:98:39
  16: datafusion_functions_nested::all_default_nested_functions
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/lib.rs:129:9
  17: datafusion::execution::session_state_defaults::SessionStateDefaults::default_scalar_functions
             at /home/joe/Projects/datafusion/datafusion/core/src/execution/session_state_defaults.rs:108:31
  18: datafusion::execution::session_state::SessionStateBuilder::with_default_features
             at /home/joe/Projects/datafusion/datafusion/core/src/execution/session_state.rs:1088:36
  19: datafusion::execution::context::SessionContext::new_with_config_rt
             at /home/joe/Projects/datafusion/datafusion/core/src/execution/context/mod.rs:330:21
  20: datafusion_cli::main_inner::{{closure}}
             at ./src/main.rs:173:15
  21: datafusion_cli::main::{{closure}}
             at ./src/main.rs:131:34
  22: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9
  23: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:63
  24: tokio::runtime::coop::with_budget
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:107:5
  25: tokio::runtime::coop::budget
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:73:5
  26: tokio::runtime::park::CachedParkThread::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:31
  27: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/blocking.rs:66:9
  28: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  29: tokio::runtime::context::runtime::enter_runtime
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/runtime.rs:65:16
  30: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  31: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:370:45
  32: tokio::runtime::runtime::Runtime::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:340:13
  33: datafusion_cli::main
             at ./src/main.rs:131:5
  34: core::ops::function::FnOnce::call_once
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Process finished with exit code 101

During function execution:

> SELECT array_element(1);
Error during planning: Internal error: Function 'array_element' expected at least one argument array argument.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker No function matches the given name and argument types 'array_element(Int64)'. You might need to add explicit type casts.
        Candidate functions:
        array_element(index)

I've removed ArrayFunctionArguments, but still wanted to explain it's rationale in case it wasn't obvious.

@@ -94,7 +94,7 @@ impl Default for ArrayHas {
impl ArrayHas {
pub fn new() -> Self {
Self {
signature: Signature::array_and_element(Volatility::Immutable),
signature: Signature::array_and_element(Volatility::Immutable, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this, we don't necessary need to wrap into a util function, it is less helpful when there are many fields

            signature: Signature {
                type_signature: TypeSignature::ArraySignature(
                    ArrayFunctionSignature::Array {
                        arguments: vec![
                            ArrayFunctionArgument::Array,
                            ArrayFunctionArgument::Element,
                        ],
                        array_coercion: Some(ListCoercion::FixedSizedListToList),
                    },
                ),
                volatility: Volatility::Immutable,
            },

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but to avoid breaking Signature::array_and_element we can keep one without additional array_coercion field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I don't think I fully understood what you're suggesting. I pushed d404c5e to try and address this, please let me know if that's what you had in mind.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 12, 2025

It is because of this, I think we now only coerce to list if the flag is set

Are you saying that the function should look something like this?

fn array(
        array_type: &DataType,
        array_coercion: Option<&ListCoercion>,
    ) -> Option<DataType> {
        match (array_type, array_coercion) {
            (
                DataType::FixedSizeList(field, _),
                Some(ListCoercion::FixedSizedListToList),
            ) => Some(DataType::List(Arc::clone(field))),
            (
                DataType::List(_)
                | DataType::LargeList(_)
                | DataType::FixedSizeList(_, _),
                _,
            ) => Some(array_type.clone()),
            _ => None,
        }
    }

Doing that causes some tests in array.slt to fail and I'm not really sure why. It also doesn't match the previous behavior.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 13, 2025

It is because of this, I think we now only coerce to list if the flag is set

Are you saying that the function should look something like this?

fn array(
        array_type: &DataType,
        array_coercion: Option<&ListCoercion>,
    ) -> Option<DataType> {
        match (array_type, array_coercion) {
            (
                DataType::FixedSizeList(field, _),
                Some(ListCoercion::FixedSizedListToList),
            ) => Some(DataType::List(Arc::clone(field))),
            (
                DataType::List(_)
                | DataType::LargeList(_)
                | DataType::FixedSizeList(_, _),
                _,
            ) => Some(array_type.clone()),
            _ => None,
        }
    }

Doing that causes some tests in array.slt to fail and I'm not really sure why. It also doesn't match the previous behavior.

It might because of array_coercion you set for the function is incorrect 🤔

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 13, 2025

It might because of array_coercion you set for the function is incorrect 🤔

The query that fails is array.slt:1221 which is

query IT
select array_element(arrow_cast(make_array(1, 2, 3, 4, 5), 'FixedSizeList(5, Int64)'), 2), array_element(arrow_cast(make_array('h', 'e', 'l', 'l', 'o'), 'FixedSizeList(5, Utf8)'), 3);
----
2 l

The error is

External error: query failed: DataFusion error: Execution error: array_element does not support type: FixedSizeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 5)

The reason is that array_element explicitly does not support FixedSizeList, and my guess is that many other functions don't as well.

/// array_element SQL function
///
/// There are two arguments for array_element, the first one is the array, the second one is the 1-indexed index.
/// `array_element(array, index)`
///
/// For example:
/// > array_element(\[1, 2, 3], 2) -> 2
fn array_element_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
let [array, indexes] = take_function_args("array_element", args)?;
match &array.data_type() {
List(_) => {
let array = as_list_array(&array)?;
let indexes = as_int64_array(&indexes)?;
general_array_element::<i32>(array, indexes)
}
LargeList(_) => {
let array = as_large_list_array(&array)?;
let indexes = as_int64_array(&indexes)?;
general_array_element::<i64>(array, indexes)
}
_ => exec_err!(
"array_element does not support type: {:?}",
array.data_type()
),
}
}

We can't set array_coercion to Some(ListCoercion::FixedSizedListToList), because we have tests that explicitly test that the inner FixedSizedList is not coerced to a List.

Just to re-iterate, it's the existing behavior that ALL array functions coerce the outermost FixedSizeList to List. The array() function is already there in main. My guess is because most function implementations are not set up to correctly handle FixedSizeList, so it's coerced to a List.

This commit allows for more expressive array function signatures.
Previously, `ArrayFunctionSignature` was an enum of potential argument
combinations and orders. For many array functions, none of the
`ArrayFunctionSignature` variants worked, so they used
`TypeSignature::VariadicAny` instead. This commit will allow those
functions to use more descriptive signatures which will prevent them
from having to perform manual type checking in the function
implementation.

As an example, this commit also updates the signature of the
`array_replace` family of functions to use a new expressive signature,
which removes a panic that existed previously.

There are still a couple of limitations with this approach. First of
all, there's no way to describe a function that has multiple different
arrays of different type or dimension. Additionally, there isn't
support for functions with map arrays and recursive arrays that have
more than one argument.

Works towards resolving apache#14451
@jayzhan211
Copy link
Contributor

array_element doesn't change the list so not coerce to list makes sense to me.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 13, 2025

Thanks for the review @jayzhan211 and for all of your helpful feedback!

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 13, 2025

I think this might also be an API change? I don't have the permissions to add the tag though.

@jayzhan211 jayzhan211 added the api change Changes the API exposed to users of the crate label Feb 14, 2025
@jayzhan211 jayzhan211 merged commit 469f18b into apache:main Feb 14, 2025
24 checks passed
@jayzhan211
Copy link
Contributor

Thanks @jkosh44

@jkosh44 jkosh44 deleted the array-signatures branch February 14, 2025 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants